26 research outputs found
Morphology-Syntax interface for Turkish LFG
This paper investigates the use of sublexical units as a solution to handling the complex morphology with productive derivational processes, in the development of a lexical functional grammar for Turkish. Such sublexical units make it possible to expose the internal structure of words with multiple derivations to the grammar rules in a uniform manner. This in turn leads to more succinct and manageable rules. Further, the semantics of the derivations can also be systematically reflected in a compositional way by constructing PRED values on the fly. We illustrate how we use sublexical units for handling simple productive derivational morphology and more interesting cases such as causativization, etc., which change verb valency. Our priority is to handle several linguistic phenomena in order to observe the effects of our approach on both the c-structure and the f-structure representation, and grammar writing, leaving the coverage and evaluation issues aside for the moment
Building a lexical functional grammar for Turkish
Large-scale, deep grammars with structurally rich output are basic resources for complex tools in human-computer interaction and also for exploring the linguistic phenomena of a language. In this thesis, we introduce a large scale grammar for Turkish implemented in the Lexical Functional Grammar formalism. Developing a large scale grammar requires that several issues be solved, both linguistically and computationally. As the language to be dealt with is Turkish, rich morphological structures play an important role in constructing the basis of the representation. We follow an approach based on building units that are larger than a morpheme but smaller than a word, in encoding rules of the grammar to explain the linguistic phenomena in a more formal and accurate way. Our implementation covers rules ranging from basic constituents such as adjective, adverbial, or prepositional phrases to more complex types with derivations such as sentential complements, sentential adjuncts, and relative clauses. The noun phrase subgrammar is the core of the system. Other important rules deal with several types of sentence structures, free word order, and coordination. Also, a date-time grammar developed earlier is integrated into our system. Some of the frequently occuring phenomena, such as causatives, passives, noun-verb compounds, and non-canonical objects, are also important from a theoretical perspective. We first examine their linguistic representation and then analyze the details of different types of causatives and non-canonical objects by conducting several tests. We then provide their implementation. To evaluate our grammar we have experimented with real world data. Results show that we have a reasonably high coverage in noun phrases (85.5%). We have also integrated our system into a tool called LingBrowser
Building a wordnet for Turkish
This paper summarizes the development process of a wordnet for Turkish as part of the Balkanet project. After discussing the basic method-ological issues that had to be resolved during the course of the project, the paper presents the basic steps of the construction process in chronological order. Two applications using Turkish wordnet are summarized and links to resources for wordnet builders are provided at the end of the paper
Altsözcüksel birimlerle Türkçe için sözcüksel işlevsel gramer geliştirilmesi
Bu bildiri Türkçe’nin karmaşık biçimbilimsel yapısı ve zengin türetme olaylarını ele alırken bir çözüm olarak altsözcüksel birimler kullanmayı incelemekte ve önerilen yaklaşımı Pargram projesi dahilinde gerçeklenmekte olan Türkçe sözcüksel işlevsel gramer üzerinden anlatmaktadır. İzlediğimiz yaklaşım sayesinde kurallar daha düzenli ve özlü bir şekilde yazılabilmekte, böylece hem genelleme imkanı arttığı için daha az sayıda olan hem de içerik olarak karmaşık olmayan kuralarla gramer kapsamı genişletilebilmektedir. Üstelik türetmelerin sözcüklere anlambilimsel katkıları programın çalışması sırasında yaratılan PRED değerleri sayesinde sistematik bir biçimde ifade edilebilmektedir. Çalışmamız altsözcüksel birimlerin basit yapım ekleri ile kullanımına yer vermekte daha sonra ettirgen yapılar gibi görece daha karmaşık dil olaylarına değinmektedir. Öncelikli amacımız kullandığımız yaklaşımı mümkün olduğunca birbirinden farklı dilbilimsel alanlarda incelemek olduğu için bu bildiride sayısal bir değerlendirmeye yer verilmemiştir
Lexical Normalization for Code-switched Data and its Effect on POS Tagging
Lexical normalization, the translation of non-canonical data to standard
language, has shown to improve the performance of manynatural language
processing tasks on social media. Yet, using multiple languages in one
utterance, also called code-switching (CS), is frequently overlooked by these
normalization systems, despite its common use in social media. In this paper,
we propose three normalization models specifically designed to handle
code-switched data which we evaluate for two language pairs: Indonesian-English
(Id-En) and Turkish-German (Tr-De). For the latter, we introduce novel
normalization layers and their corresponding language ID and POS tags for the
dataset, and evaluate the downstream effect of normalization on POS tagging.
Results show that our CS-tailored normalization models outperform Id-En state
of the art and Tr-De monolingual models, and lead to 5.4% relative performance
increase for POS tagging as compared to unnormalized input
Treebanking user-generated content: A proposal for a unified representation in universal dependencies
The paper presents a discussion on the main linguistic phenomena of user-generated texts found in web and social media, and proposes a set of annotation guidelines for their treatment within the Universal Dependencies (UD) framework. Given on the one hand the increasing number of treebanks featuring user-generated content, and its somewhat inconsistent treatment in these resources on the other, the aim of this paper is twofold: (1) to provide a short, though comprehensive, overview of such treebanks - based on available literature - along with their main features and a comparative analysis of their annotation criteria, and (2) to propose a set of tentative UD-based annotation guidelines, to promote consistent treatment of the particular phenomena found in these types of texts. The main goal of this paper is to provide a common framework for those teams interested in developing similar resources in UD, thus enabling cross-linguistic consistency, which is a principle that has always been in the spirit of UD
CoNLL-UL: Universal Morphological Lattices for Universal Dependency Parsing
International audienceFollowing the development of the universal dependencies (UD) framework and the CoNLL 2017 Shared Task on end-to-end UD parsing, we address the need for a universal representation of morphological analysis which on the one hand can capture a range of different alternative morphological analyses of surface tokens, and on the other hand is compatible with the segmentation and morphological annotation guidelines prescribed for UD treebanks. We propose the CoNLL universal lattices (CoNLL-UL) format, a new annotation format for word lattices that represent morphological analyses, and provide resources that obey this format for a range of typologically different languages. The resources we provide are harmonized with the two-level representation and morphological annotation in their respective UD v2 treebanks, thus enabling research on universal models for morphological and syntactic parsing , in both pipeline and joint settings, and presenting new opportunities in the development of UD resources for low-resource languages